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EXHIBIT 2 



EAFX Proposal 
Draft: 



1) Reiterate EAFX Project Goal: 

Use Ingenuity's content to identify connections between expression results and 
biological pathways that do not appear to be the product of random chance. 



2) Suggest range of possible solutions. 

From simple to futuristic. 



Difficulty 


Solution 


Basic 

(Implementable) 


Simple relationships between genes 


Take set of genes-> Identify all direct facts 
linking genes. 

Identify largest connected groupings. 
Identify links with lots of facts. 


Realistic but 
risker than 
supervised 
approach. 


Unsupervised approach to identifying clusters. 
Develop algorithms that can automatically 
identify functional clusters based on 
correlations between user genes and knowledge 
connectedness/densities in our kb. 


Futuristic 
(Science- 
fiction) 


Create a virtual model of tissue/disease specific 
cells using expanded Ingenuity structured 
content (scientific literature, genomics data, 
bioinformatics data, canonical knowledge, user 
knowledge, pre-existing analysis). Develop 
sophisticated algorithms that predict behavior 
and that identify mechanistic explanations for 
dysregulated pathways. 



EXHIBIT 2 Con't. 



3) Define improvement axis (biologically believable, significance likelihood, 
decision relevance) 

The value of our product increases by improving the user's x,y or z with the 
"pathways" generated by our analysis. 



1) Biologically believable: The results are consistent with the user's understanding of 
biology, (ie. Fill In an example) 

2) Significance likelihood: The results do not appear to be the product of random chance, 
(unique, unexpected, specificity to their input, correlated with input) 

(i.e. Most of the genes upregulated by a specific transcription factor are among the input 

genes.) 

3) Decision Relevance: The results are applicable to the user's decision-making process. 

i.e. Interesting drug discovery traits: 

a. Uniqueness/Novelty 

b. Patent 

c. Tissue Specificity (Link to body atlas) 

d. Toxicity 

e. Disease 



4) Possible features that would "improve" performance/functionality. 



INPUT 
CONTENT 

ALGORITHMS/SCORING 



EXHIBIT 2 Con't. 



Improvement 


Biologically 
Believable 


Significance 
Likelihood 


Decision 
Relevance 


Other Notes 


ttl tt\t yrn 

INPUT 










List of Genes 










Cluster 


Baseline 


Baseline 


Baseline 




All measured 




+++ 


++ 




Cluster membership 


+ 


+++ 


+++ 


Assumes 
belief that 
clusters have 
significance 


Expression values 










Dir of Change 


+++ 


++ 






Quantity (1 exp) 


++ 


++ 






Quantity over time 
(Time Series) 


++++ 


+++ 






Experimental 
Context 










Disregulated Genes 


+++ 


++ 


++ 


Knockouts, 
overexpression 


Cell/Tissue Source 


+++ 


++ 


++ 


Includes 
expression 
specificity 


Cell/Tissue disease 
state 










Cell/Tissue 
Treatment (Small 
molecule, irradation) 










CONTENT 










Kb Objects 










Unspecified 


Baseline 


Baseline 


Baseline 




Mutant vs Wildtype 










Localization 










Active vs inactive 
state 










Complex vs 
unbound 










Species specificity 





















EXHIBIT 2 Con't. 



KB Processes 
Molecular 
Modification 
Complex formation 

Confidence of link 
Negated information 
Coupling (indirect vs indirect) 

Fact type 
Structure 

Disease correlation 



Include simple example/output that this would allow 
5) Internal recommendation: 

Define baseline proposal (in addition helps us better understand the system) 



Realistic 
Requires 


Supervised approach to identifying clusters. 
Use expert/algorithmic rules to generate 
potentially meaningful biological profiles. Scan 
user's genes against all profiles to identify 
interesting mechanisms. Refine profiles based 
on user's particular genes. 


Realistic but 
risker than 
supervised 
approach. 


Unsupervised approach to identifying clusters. 
Develop algorithms that can automatically 
identify functional clusters based on 
correlations between user genes and knowledge 
connectedness/densities in our kb. 



EXHIBIT 3 



I worked out the probability (not p-value) calculation for the null hypothesis match. It is most significantly 
impacted by the overlap (the number of 'significant' user genes and the KB genes in a particular BCP). I 
implemented the machine precision-optimized calculation in PERL and checked it into the eafx/scripts 
directory 'random_match_prob.pl'. Please read below (also in the source file) for details. 

Dan 



# Compute probabily of getting BCP match by chance for null hypothesis 

# of BCP generated randomly. 
# 

# Dan Richards 

# [DATE REDACTED] 

# 

# Inputs: 

# SIG - number of significant user genes that are mapped to KB genes 

# OVP - number of (significant) user genes that overlap the KB genes 

# in the BCP 

# MAP - total number of user genes assayed (not necessarily 
significant) 

# that are mapped to KB genes. 

# KB - number of KB genes (which could appear in a BCP — ie. have 

# suitable content) 

# BCP - number of KB genes in the BCP 
# 

# Formula for significance : 
# 

# 1. P(USER_OVP) = probability that the particular number of overlapping 

# genes occur in the user's data set 

# = Choose (SIG, OVP) / Choose (MAP, OVP) 

# 2. P(BCP_OVP) = probability that the particular number of overlapping 

# genes occur in the BCP 

# = (Choose (OVP, OVP) * Choose (KB-OVP, BCP-OVP) ) / 
Choose (KB, BCP) 

# = Choose (KB-OVP, BCP-OVP) / Choose (KB, BCP) 

# Note: Choose (OVP, OVP) = 1 

# 3. P(OVP) = P(USER_OVP) and P(BCP_OVP) 

# = P(USER_OVP) * P(BCP_OVP) 

# = (Choose (SIG, OVP) * Choose (KB-OVP, BCP-OVP) ) / 

# (Choose (MAP, OVP) * Choose (KB, BCP) ) 
# 

# Implications: 

# 1. For a fixed set KB genes, and a fixed number of SIGnificant user genes: 

# a. The larger the BCP, the MORE likely the match occurred by chance 

# b. The larger the OVP, the LESS likely the match occurred by chance 

# 2. For a fixed number of OVP genes, and a fixed size of the matched BCP: 

# a. The larger the SIG, the MORE likely the match occurred by chance 

# b. The larger the KB, the LESS likely the match occurred by chance 

# 3. If BCP=KB, then if there is any overlap, P(OVP) is unity (1). 

# 4. If SIG=KB, then if there is any overlap, P(OVP) is unity, since this 

# implies that every gene in the KB is also significant user gene, so 

# every match is expected. 

# 5. If MAP<KB, then the P(OVP) is greater (MORE likely) than if MAP=KB 
# 

# So overall, P(OVP) is minimized (LEAST likely) if (in decreasing 
likelihood) : 

# KB » BCP, OVP » 1, MAP=KB, BCP=OVP, SIG=OVP 



EXHIBIT 3 Con't. 



# Note: an overlap of more than 1 to a BCP with more than 1 gene is MUCH 

# less probable than an overlap of 1 to a BCP with only 1 gene. 
# 

# Invariants: 

# KB >= 0 

# BCP <= KB 

# MAP <= KB 

# SIG <= MAP 

# OVP <= BCP, OVP <= SIG 

sub p_overlap { 

# Computes p(OVP) result to highest possible machine precision: 
# 

# P(OVP) formula simplifies to: 

# (SIG! * BCP! * (KB-OVP)! * (MAP-OVP)!) / 

# (SIG-OVP)! * (BCP-OVP)! * KB! * MAP! 
# 

# Note: 

# n! = GAMMA(n+1) 
# 

# Uses log() to maintain highest possible numerical machine precision 
# 

# Non-optimized (equivalent) formula: 

# return (choose($sig,$ovp)*choose($kb-$ovp,$bcp-$ovp))/(choose($map,$ovp)*choose($kb,$bcp)); 
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